# Intro to Thompson Sampling
Thompson Sampling is a method of reinforcement learning.
# One-Armed Bandit
This method can be used to solve the following problem. Imagine you are at a casino with a row of slot machines, you know that the machines have different chances of player winning. You want to find the one with the highest win probability in the smallest amount of rounds.
Let's assume that one round costs €1 and you have €1000 to spend, in other words you will play 1000 rounds.
# Beta Function
Thompson Sampling uses something called a beta-function. It is defined as follows:
You can notice that when increases the graph of will move to the right and when increases it will move to the left.
# Sampling
In the One-Armed Bandit problem we can solve it by using the amount of times we won and the amount of times we lost on each slot machine as our and . Then using those constants to create a beta-function and sample it, then we choose the machine whose function sample was highest, which will yield the optimal machine quickly.
# Code
First we create the environment and array which will store results of each round.
import numpy as np
conversionRates = [0.15, 0.04, 0.13, 0.11, 0.05]
N = 10000
d = len(conversionRates)
X = np.zeros((N, d))
for i in range(N):
for j in range(d):
if np.random.rand() < conversionRates[j]:
X[i][j] = 1
Then we can use Thompson Sampling and beta-function (distribution) to train our model. The graphs of beta-function of each machine will move to the left and right, making the one with best conversion rate to move to the right causing it to sample better and better which will make our model choose it more often.
nPosReward = np.zeros(d)
nNegReward = np.zeros(d)
for i in range(N):
selected = 0
maxRandom = 0
for j in range(d):
randomBeta = np.random.beta(nPosReward[j] + 1, nNegReward[j] + 1)
if randomBeta > maxRandom:
maxRandom = randomBeta
selected = j
if X[i][selected] == 1:
nPosReward[selected] += 1
else:
nNegReward[selected] += 1